Loading Data Into R
ASI: Introduction to R
Loading Data Into R
Data In R
- Working with data in R is very different to Excel
- Can have complicated structures or be very simple (e.g.
x <- 1:5) - Spreadsheet-like data is very common
- The
Requivalent is known as adata.frame - Has many variants, e.g.
tbl_dfortibble(SQL-inspired) - We’ll mainly use the
tibblevariant today
- The
- We import the data as an
Robject- All analysis is performed on the
Robject - Almost never modify the source file
- All analysis is performed on the
Importing Data
- Cell formatting will be ignored by R
- Plots will also be ignored
- Blank rows are not fatal, just annoying
- Mixtures of numbers and text in a column
data.frames are structured with vectors as columns
- Deleted cells are sometimes imported as blank rows/columns
- Comma-separated or tab-separated files are favoured for
R- i.e. plain text, or just the data
Other Common Excel Issues
- Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN1 not SEPT1 1 etc.
Preparation
File>New File>R Script(OrCtrl+Shift+N)- Save as
DataImport.R
Import Using the GUI
Importing Data
- Preview the file
pigs.csvby clicking on it (View File)- Try in Excel if you prefer, but DO NOT save anything from Excel
- The data measures tooth (i.e. odontoblast) length in guinea pigs
- Using 3 dose levels of Vitamin C (“Low”, “Med”, “High”)
- Vitamin C was given in drinking water or using orange juice
- “OJ” or “VC”
Using the GUI To Load Data
Click on the pigs.csv and choose Import Dataset then stop!
(Click Update if you don’t see this)
The Preview Window
- This is another preview of the data before we import it
- There are 3 columns:
len,suppanddoselenis a double (numeric)- The other two are character columns
What just happened?
The code we copied has 3 lines:
- Loads the package
readrusinglibrary(readr)- Packages are collections (i.e. libraries) of related functions
- All
readrfunctions are about importing data
readrcontains the functionread_csv()read_csv()tells R what to do with a csv file
Let’s Demonstrate
- In the
Environment Tabclick the broom icon ()
- This will delete everything from your
R Environment - It won’t unload the packages
- This will delete everything from your
Select the code we’ve just pasted and send it to the console
Reloading the packages won’t hurtCheck the
Environment Tabagain andpigsis back
- You can delete the line
View(pigs)
Realistically we only need to preview it the first time. Having that preview open every time actually ends up being really annoying
Data Frame Objects
Data Frame Objects
- The object
pigsis known as adata.frame- Very similar to an SQL table
Requivalent to a spreadsheet- Missing values (blank cells) are usually filled with
NA - Must have column names \(\implies\) row names becoming less common
- Missing values (blank cells) are usually filled with
Tibble Objects
readruses a variant called atbl_dfortbl(pronounced tibble)- A
data.framewith nice bonus features (e.g. prints a summary only) - Similar to a SQL table
- Can only have row numbers for row names
- Is a foundational structure in the
tidyverse
- A
The Tidyverse
- The
tidyverseis a collection of thematically-linked packages- Produced by developers from RStudio/Posit
- Often referred to as tidy-programming or similar
- Calling
library(tidyverse)loads all of these packages- \(>\) 10 convenient packages in one line
readris one of these \(\implies\) usually just load the tidyverse
library(tidyverse)Functions
Functions in R
head(pigs)
glimpse(pigs)- Here we have called the functions 1)
head()and 2)glimpse()- They were both executed on the object
pigs
- They were both executed on the object
- Call the help page for
head()
?head(if you get multiple options, choose the one from utils)
Function Arguments
head()prints the first part of an object- Useful for very large objects (e.g. if we had 1000 pigs)
- We can change the number of rows shown to us
head(pigs, 4)# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
Understanding read_csv()
- Earlier we called the
Rfunctionread_csv() - Check the help page
?read_csv- We have four functions shown but stick to
read_csv()
Closing Comments
read_csv() Vs read.csv()
RStudionow usesread_csv()fromreadrby default- You will often see
read.csv()in older scripts (fromutils) - The newer (
readr) version is:- slightly faster
- more user-friendly
- creates fewer issues
- gives informative messages
- Earlier functions in
utilsareread.*()(csv, delim etc.) readrhas the functionsread_*()(csv, tsv, delim etc.)- I always use the newer ones
Reading Help Pages: Bonus Slide
- The bottom three functions are simplified wrappers to
read_delim() read_csv()callsread_delim()usingdelim = ","read_csv2()callsread_delim()usingdelim = ";"read_tsv()callsread_delim()usingdelim = "\t"
What function would we call for space-delimited files?
Loading Excel Files
- The package
readxlis for loading.xlsandxlsxfiles. - Not part of the core tidyverse but very compatible
library(readxl)- The main function is
read_excel()
?read_excelReferences
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biol. 17 (1): 177.